In [1]:
import datashader as ds
import datashader.transfer_functions as tf
import datashader.glyphs
from datashader import reductions
from datashader.core import bypixel
from datashader.utils import lnglat_to_meters as webm, export_image
from datashader.colors import colormap_select, Greys9, viridis, inferno
import copy


from pyproj import Proj, transform
import numpy as np
import pandas as pd
import urllib
import json
import datetime
import colorlover as cl

import plotly.offline as py
import plotly.graph_objs as go
from plotly import tools

from shapely.geometry import Point, Polygon, shape
# In order to get shapley, you'll need to run [pip install shapely.geometry] from your terminal

from functools import partial

from IPython.display import GeoJSON

py.init_notebook_mode()

For module 2 we'll be looking at techniques for dealing with big data. In particular binning strategies and the datashader library (which possibly proves we'll never need to bin large data for visualization ever again.)

To demonstrate these concepts we'll be looking at the PLUTO dataset put out by New York City's department of city planning. PLUTO contains data about every tax lot in New York City.

PLUTO data can be downloaded from here. Unzip them to the same directory as this notebook, and you should be able to read them in using this (or very similar) code. Also take note of the data dictionary, it'll come in handy for this assignment.

In [2]:
# Code to read in v17, column names have been updated (without upper case letters) for v18

# bk = pd.read_csv('PLUTO17v1.1/BK2017V11.csv')
# bx = pd.read_csv('PLUTO17v1.1/BX2017V11.csv')
# mn = pd.read_csv('PLUTO17v1.1/MN2017V11.csv')
# qn = pd.read_csv('PLUTO17v1.1/QN2017V11.csv')
# si = pd.read_csv('PLUTO17v1.1/SI2017V11.csv')

# ny = pd.concat([bk, bx, mn, qn, si], ignore_index=True)

ny = pd.read_csv('nyc_pluto_18v2_csv/pluto_18v2.csv', low_memory=False) # Warning said to use 'low_memory=False' but I really have no idea why...


# Getting rid of some outliers
ny = ny[(ny['yearbuilt'] > 1850) & (ny['yearbuilt'] < 2020) & (ny['numfloors'] != 0)]

I'll also do some prep for the geographic component of this data, which we'll be relying on for datashader.

You're not required to know how I'm retrieving the lattitude and longitude here, but for those interested: this dataset uses a flat x-y projection (assuming for a small enough area that the world is flat for easier calculations), and this needs to be projected back to traditional lattitude and longitude.

In [3]:
wgs84 = Proj("+proj=longlat +ellps=GRS80 +datum=NAD83 +no_defs")
nyli = Proj("+proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +ellps=GRS80 +datum=NAD83 +to_meter=0.3048006096012192 +no_defs")
ny['xcoord'] = 0.3048*ny['xcoord']
ny['ycoord'] = 0.3048*ny['ycoord']
ny['lon'], ny['lat'] = transform(nyli, wgs84, ny['xcoord'].values, ny['ycoord'].values)

ny = ny[(ny['lon'] < -60) & (ny['lon'] > -100) & (ny['lat'] < 60) & (ny['lat'] > 20)]

#Defining some helper functions for DataShader
background = "black"
export = partial(export_image, background = background, export_path="export")
cm = partial(colormap_select, reverse=(background!="black"))

Part 1: Binning and Aggregation

Binning is a common strategy for visualizing large datasets. Binning is inherent to a few types of visualizations, such as histograms and 2D histograms (also check out their close relatives: 2D density plots and the more general form: heatmaps.

While these visualization types explicitly include binning, any type of visualization used with aggregated data can be looked at in the same way. For example, lets say we wanted to look at building construction over time. This would be best viewed as a line graph, but we can still think of our results as being binned by year:

In [4]:
trace = go.Scatter(
    # I'm choosing BBL here because I know it's a unique key.
    x = ny.groupby('yearbuilt').count()['bbl'].index,
    y = ny.groupby('yearbuilt').count()['bbl']
)

layout = go.Layout(
    xaxis = dict(title = 'Year Built'),
    yaxis = dict(title = 'Number of Lots Built')
)

fig = go.Figure(data = [trace], layout = layout)

py.iplot(fig)

Something looks off... You're going to have to deal with this imperfect data to answer this first question.

But first: some notes on pandas. Pandas dataframes are a different beast than R dataframes, here are some tips to help you get up to speed:


Hello all, here are some pandas tips to help you guys through this homework:

Indexing and Selecting: .loc and .iloc are the analogs for base R subsetting, or filter() in dplyr

Group By: This is the pandas analog to group_by() and the appended function the analog to summarize(). Try out a few examples of this, and display the results in Jupyter. Take note of what's happening to the indexes, you'll notice that they'll become hierarchical. I personally find this more of a burden than a help, and this sort of hierarchical indexing leads to a fundamentally different experience compared to R dataframes. Once you perform an aggregation, try running the resulting hierarchical datafrome through a reset_index().

Reset_index: I personally find the hierarchical indexes more of a burden than a help, and this sort of hierarchical indexing leads to a fundamentally different experience compared to R dataframes. reset_index() is a way of restoring a dataframe to a flatter index style. Grouping is where you'll notice it the most, but it's also useful when you filter data, and in a few other split-apply-combine workflows. With pandas indexes are more meaningful, so use this if you start getting unexpected results.

Indexes are more important in Pandas than in R. If you delve deeper into the using python for data science, you'll begin to see the benefits in many places (despite the personal gripes I highlighted above.) One place these indexes come in handy is with time series data. The pandas docs have a huge section on datetime indexing. In particular, check out resample, which provides time series specific aggregation.

Merging, joining, and concatenation: There's some overlap between these different types of merges, so use this as your guide. Concat is a single function that replaces cbind and rbind in R, and the results are driven by the indexes. Read through these examples to get a feel on how these are performed, but you will have to manage your indexes when you're using these functions. Merges are fairly similar to merges in R, similarly mapping to SQL joins.

Apply: This is explained in the "group by" section linked above. These are your analogs to the plyr library in R. Take note of the lambda syntax used here, these are anonymous functions in python. Rather than predefining a custom function, you can just define it inline using lambda.

Browse through the other sections for some other specifics, in particular reshaping and categorical data (pandas' answer to factors.) Pandas can take a while to get used to, but it is a pretty strong framework that makes more advanced functions easier once you get used to it. Rolling functions for example follow logically from the apply workflow (and led to the best google results ever when I first tried to find this out and googled "pandas rolling")

Google Wes Mckinney's book "Python for Data Analysis," which is a cookbook style intro to pandas. It's an O'Reilly book that should be pretty available out there.


Question

After a few building collapses, the City of New York is going to begin investigating older buildings for safety. The city is particularly worried about buildings that were unusually tall when they were built, since best-practices for safety hadn’t yet been determined. Create a graph that shows how many buildings of a certain number of floors were built in each year (note: you may want to use a log scale for the number of buildings). Find a strategy to bin buildings (It should be clear 20-29-story buildings, 30-39-story buildings, and 40-49-story buildings were first built in large numbers, but does it make sense to continue in this way as you get taller?)

Binning Data

I spent an insane amount of time trying different methods of binning and grouping data and had a lot of trouble getting the data into the format needed in order to be able to plot in plotly. I left a lot of the code below even though some of it wasn't really needed in order to show what I tried and keep as notes for future reference.

I binned data for the number of floors and for the decade in which a building was constructed and added those as new columns to the original dataframe as well as creating grouped aggragated dataframes with the same bins.

I discoverd after MANY failures that in order to plot the data in plotly any categorical variables MUST have labels. Unlabeled columns just would not plot even if coverted to a list or series. Indexes also had to be converted to columns with labels if they were categorical.

Number of Floors

In [5]:
# Start your answer here, inserting more cells as you go along

#ny[['numfloors','yearbuilt']][1:5]
#ny.groupby('yearbuilt').count()['bbl'].index

bins1 = [0, 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 60, 75, 150, 300]
labels1 = ['1', '2', '3', '4', '5', '6-10', '11-15', '16-20', '21-25', '26-30', '31-35', '36-40', '41-45', '46-60', '61-75', '76-150', '151-300']
floors = pd.cut(ny['numfloors'], bins1, labels=labels1)

ny['numfloors_bins'] = floors
ny['numfloors_bins'].head()
Out[5]:
0    2
1    3
2    2
3    2
4    3
Name: numfloors_bins, dtype: category
Categories (17, object): [1 < 2 < 3 < 4 ... 46-60 < 61-75 < 76-150 < 151-300]
Changed the code above and then updated the chart below to test different bin sizes and find appropriate cutoffs
In [6]:
Floors = pd.DataFrame(ny['numfloors'].groupby(floors).count())
Floors.columns = ['count']
Floors.reset_index(inplace=True)
Floors
Out[6]:
numfloors count
0 1 76698
1 2 451924
2 3 215571
3 4 28764
4 5 16914
5 6-10 17042
6 11-15 2660
7 16-20 1340
8 21-25 542
9 26-30 273
10 31-35 265
11 36-40 144
12 41-45 130
13 46-60 125
14 61-75 26
15 76-150 12
16 151-300 1
Test plot to make sure this works...

I also wanted to get an idea of the distribution of different size buildings. The plot makes it clear that it will probably be necessary to log transform the data to see the distributions later since the bin for 2 story buildings is more than twice the size of any of the others and it can't be cut any smaller.

In [7]:
trace = go.Bar(
    # I'm choosing BBL here because I know it's a unique key.
    x = ny.groupby('numfloors_bins').count()['bbl'].index,
    y = ny.groupby('numfloors_bins').count()['bbl']
)

layout = go.Layout(
    xaxis = dict(title = 'numfloors_bins'),
    yaxis = dict(title = 'Number of Lots Built')
)

fig = go.Figure(data = [trace], layout = layout)

py.iplot(fig)

Year Built

In [8]:
bins2 = range(1850, 2021, 10)
labels2 = []
for year in range(1850, 2020, 10):
              labels2.append(str(year) + "'s")
dates = pd.cut(ny['yearbuilt'], bins2, labels=labels2)

ny['yearbuilt_bins'] = pd.cut(ny['yearbuilt'], bins2, labels=labels2)
ny['yearbuilt_bins'].head()
Out[8]:
0    1920's
1    1930's
2    1970's
3    1910's
4    1920's
Name: yearbuilt_bins, dtype: category
Categories (17, object): [1850's < 1860's < 1870's < 1880's ... 1980's < 1990's < 2000's < 2010's]
Changed the code above and then updated the chart below to test different bin sizes and find appropriate cutoffs
In [9]:
Year_Built = pd.DataFrame(ny['yearbuilt'].groupby(dates).count())
Year_Built.columns = ['count']
Year_Built.reset_index(inplace=True)
Year_Built
Out[9]:
yearbuilt count
0 1850's 127
1 1860's 109
2 1870's 255
3 1880's 619
4 1890's 34483
5 1900's 82746
6 1910's 111565
7 1920's 166195
8 1930's 101321
9 1940's 75961
10 1950's 69893
11 1960's 42867
12 1970's 20947
13 1980's 25385
14 1990's 29697
15 2000's 40060
16 2010's 10201
Another test plot to make sure this works...

You can see a huge building boom in the 1920's!

In [10]:
trace = go.Bar(
    # I'm choosing BBL here because I know it's a unique key.
    x = ny.groupby('yearbuilt_bins').count()['bbl'].index,
    y = ny.groupby('yearbuilt_bins').count()['bbl']
)

layout = go.Layout(
    xaxis = dict(title = 'Year Built'),
    yaxis = dict(title = 'Number of Lots Built')
)

fig = go.Figure(data = [trace], layout = layout)

py.iplot(fig)

Check the new columns in our dataframe...

In [11]:
ny.iloc[:, -2:].head(10)
Out[11]:
numfloors_bins yearbuilt_bins
0 2 1920's
1 3 1930's
2 2 1970's
3 2 1910's
4 3 1920's
5 3 1920's
7 6-10 2000's
8 4 2000's
9 4 1980's
11 5 1890's

Create new aggragated dataframe

After trying unsuccessfully many times to use the new categorical columns that I added to the original ny dataframe, I gave up and created a new dataframe with just the aggragated data for plotting... Even this gave me problems until I finally figured out how to reset the index here: https://github.com/pandas-dev/pandas/issues/19136

In [12]:
data = pd.DataFrame(ny['bbl'].groupby([floors, dates]).count().unstack())

#data.reset_index(inplace=True)  Throws error!!!!

# Got the function below here: https://github.com/pandas-dev/pandas/issues/19136
def reset_index(df):
  '''Returns DataFrame with index as columns'''
  index_df = df.index.to_frame(index=False)
  df = df.reset_index(drop=True)
  # In merge is important the order in which you pass the dataframes
  # if the index contains a Categorical. 
  # pd.merge(df, index_df, left_index=True, right_index=True) does not work
  return pd.merge(index_df, df, left_index=True, right_index=True)

data = reset_index(data)
data
Out[12]:
numfloors 1850's 1860's 1870's 1880's 1890's 1900's 1910's 1920's 1930's 1940's 1950's 1960's 1970's 1980's 1990's 2000's 2010's
0 1 3.0 2.0 4.0 17.0 385.0 1507.0 3789.0 7582.0 13612.0 14982.0 18099.0 8736.0 3093.0 1852.0 1376.0 1156.0 503.0
1 2 11.0 49.0 68.0 150.0 11625.0 40174.0 59274.0 93462.0 56594.0 56169.0 40869.0 23061.0 13067.0 14199.0 18416.0 20431.0 4305.0
2 3 42.0 31.0 76.0 247.0 14622.0 28371.0 37076.0 52635.0 24421.0 3741.0 9119.0 9074.0 3924.0 8311.0 8751.0 13055.0 2075.0
3 4 48.0 16.0 34.0 91.0 3615.0 6044.0 4430.0 5213.0 3812.0 189.0 227.0 194.0 188.0 304.0 545.0 2459.0 1355.0
4 5 14.0 8.0 46.0 74.0 2868.0 4321.0 4124.0 2957.0 735.0 119.0 106.0 84.0 87.0 71.0 129.0 767.0 404.0
5 6-10 8.0 2.0 26.0 37.0 1281.0 1956.0 2336.0 3212.0 1950.0 643.0 1077.0 927.0 261.0 254.0 324.0 1626.0 1122.0
6 11-15 1.0 1.0 1.0 3.0 65.0 323.0 427.0 583.0 78.0 75.0 167.0 236.0 85.0 112.0 52.0 260.0 191.0
7 16-20 NaN NaN NaN NaN 15.0 34.0 67.0 372.0 77.0 25.0 155.0 292.0 67.0 56.0 18.0 101.0 61.0
8 21-25 NaN NaN NaN NaN 6.0 8.0 22.0 104.0 11.0 9.0 41.0 130.0 34.0 48.0 18.0 56.0 55.0
9 26-30 NaN NaN NaN NaN 1.0 6.0 12.0 25.0 6.0 5.0 15.0 42.0 37.0 43.0 14.0 36.0 31.0
10 31-35 NaN NaN NaN NaN NaN 1.0 3.0 21.0 9.0 1.0 9.0 39.0 52.0 42.0 25.0 34.0 29.0
11 36-40 NaN NaN NaN NaN NaN NaN 1.0 12.0 2.0 1.0 4.0 20.0 22.0 36.0 11.0 22.0 13.0
12 41-45 NaN NaN NaN NaN NaN 1.0 2.0 9.0 5.0 1.0 3.0 19.0 16.0 28.0 5.0 23.0 18.0
13 46-60 NaN NaN NaN NaN NaN NaN NaN 7.0 4.0 1.0 2.0 13.0 14.0 26.0 12.0 28.0 18.0
14 61-75 NaN NaN NaN NaN NaN NaN 1.0 1.0 3.0 NaN NaN NaN NaN 2.0 1.0 2.0 16.0
15 76-150 NaN NaN NaN NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN 1.0 NaN 4.0 5.0
16 151-300 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Plot data for each year

After an insane amopunt of time trying all different kinds of plots and trying to tweak the labels etc... This is the best I could come up with so far. I still can't figure out how to fix a bunch of things that are bugging me:

  1. Should have axis labels on the bottom and down the let side of the entire grid, but when IO tried to add them they wound up just on the first plot. (I removed them since they were not working right)
  2. Can't figure out how to leave the last spot in the grid blank. It automatically repeated the first trace there even though I didn't assign anything to that spot.
  3. I'm not loving the log transfromaion, but it is harder to see bars for the taller buildings with the non-log transformed data
  4. Can't figure out how to get ALL the plots to have the same y-axis, not just the same on each row... As a result it's difficult to see the building boom in the 20's. All the plots look like they have similar maximum numbers if you aren't paying attention to the axes.
  5. I also would have liked to have the plots go three across, but when I had 3 columns every other label on the x-axis was missing because I guess the labels would have run into each other that way and I couldn't figure out how to change the size of the text for the labels.
In [13]:
fig = tools.make_subplots(cols = 2, rows = 9, print_grid = False, shared_yaxes=True, 
                          subplot_titles=(data.columns[1:]))

traces = []
for y in labels2:
    traces.append({'type' : 'bar',
                   'name' : y,
                   'x' : data['numfloors'].tolist(),
                   'y' : np.log10(data[y]).tolist()
                  })
traces.append(traces[0])

t=0
while t < 16:
    for r in range(1,10):
        for c in range(1,3):
            fig.append_trace(trace = traces[t], row = r, col = c)
            t +=1

fig['layout'].update(height=1500, width=800, title='Number of Floors - Log10 Transformed Building Counts', 
                     #xaxis = dict(title = 'Number of Floors'), yaxis = dict(title = 'Number of Lots Built'),
                     showlegend = False)

#fig = go.Figure(data = [traces[2]], layout = layout)

py.iplot(fig)
In [14]:
#help(tools.make_subplots)
fig
Out[14]:
Figure({
    'data': [{'name': "1850's",
              'type': 'bar',
              'uid': '43d791ee-5c67-4c50-a90c-2b1656c7633f',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x',
              'y': [0.47712125471966244, 1.0413926851582251, 1.6232492903979006,
                    1.6812412373755872, 1.146128035678238, 0.9030899869919435, 0.0,
                    nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
              'yaxis': 'y'},
             {'name': "1860's",
              'type': 'bar',
              'uid': '56c4e6ff-3325-48c8-b2eb-1c6ba6cc3237',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x2',
              'y': [0.3010299956639812, 1.6901960800285136, 1.4913616938342726,
                    1.2041199826559248, 0.9030899869919435, 0.3010299956639812,
                    0.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
              'yaxis': 'y'},
             {'name': "1870's",
              'type': 'bar',
              'uid': '37f99a2d-aa8e-436f-8b54-73d9f649cf3d',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x3',
              'y': [0.6020599913279624, 1.8325089127062364, 1.8808135922807914,
                    1.5314789170422551, 1.662757831681574, 1.414973347970818, 0.0,
                    nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
              'yaxis': 'y2'},
             {'name': "1880's",
              'type': 'bar',
              'uid': '14156a16-0912-473f-86d0-c79828750636',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x4',
              'y': [1.2304489213782739, 2.1760912590556813, 2.392696953259666,
                    1.9590413923210936, 1.8692317197309762, 1.568201724066995,
                    0.47712125471966244, nan, nan, nan, nan, nan, nan, nan, nan,
                    nan, nan],
              'yaxis': 'y2'},
             {'name': "1890's",
              'type': 'bar',
              'uid': 'f2f032f9-a12b-4ae2-9de5-82a517d5edbc',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x5',
              'y': [2.5854607295085006, 4.065392961561992, 4.165006779568368,
                    3.5581083016305497, 3.4575791469957626, 3.1075491297446862,
                    1.8129133566428555, 1.1760912590556813, 0.7781512503836436,
                    0.0, nan, nan, nan, nan, nan, nan, nan],
              'yaxis': 'y3'},
             {'name': "1900's",
              'type': 'bar',
              'uid': '351f0004-1cf9-4922-843b-da694ca4d350',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x6',
              'y': [3.1781132523146316, 4.603945075232843, 4.452874643755949,
                    3.781324455666988, 3.63558426631123, 3.291368850451583,
                    2.509202522331103, 1.5314789170422551, 0.9030899869919435,
                    0.7781512503836436, 0.0, nan, 0.0, nan, nan, nan, nan],
              'yaxis': 'y3'},
             {'name': "1910's",
              'type': 'bar',
              'uid': '00e693f5-278d-4cc7-b26b-fed1abae365b',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x7',
              'y': [3.578524605274993, 4.772864235815214, 4.569092873486665,
                    3.6464037262230695, 3.6153186566114788, 3.3684728384403617,
                    2.630427875025024, 1.8260748027008264, 1.3424226808222062,
                    1.0791812460476249, 0.47712125471966244, 0.0,
                    0.3010299956639812, nan, 0.0, nan, 0.0],
              'yaxis': 'y4'},
             {'name': "1920's",
              'type': 'bar',
              'uid': 'ba1de5dc-abaf-4186-a95f-c721260d0bec',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x8',
              'y': [3.8797837800904156, 4.970635070286307, 4.721274627271088,
                    3.717087724927019, 3.4708513245261177, 3.5067755366066433,
                    2.765668554759014, 2.5705429398818973, 2.0170333392987803,
                    1.3979400086720377, 1.3222192947339193, 1.0791812460476249,
                    0.9542425094393249, 0.8450980400142568, 0.0, nan, nan],
              'yaxis': 'y4'},
             {'name': "1930's",
              'type': 'bar',
              'uid': '71658560-2e57-45bc-befd-1e1838c890ec',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x9',
              'y': [4.133921940423772, 4.752770390463606, 4.387763443621556,
                    3.5811528919662887, 2.8662873390841948, 3.290034611362518,
                    1.8920946026904804, 1.8864907251724818, 1.0413926851582251,
                    0.7781512503836436, 0.9542425094393249, 0.3010299956639812,
                    0.6989700043360189, 0.6020599913279624, 0.47712125471966244,
                    0.3010299956639812, nan],
              'yaxis': 'y5'},
             {'name': "1940's",
              'type': 'bar',
              'uid': '0e72d49a-8e8d-45cd-a63a-188cf4f26fb3',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x10',
              'y': [4.175569792734992, 4.749496692020208, 3.572987708198205,
                    2.2764618041732443, 2.0755469613925306, 2.808210972924222,
                    1.8750612633917, 1.3979400086720377, 0.9542425094393249,
                    0.6989700043360189, 0.0, 0.0, 0.0, 0.0, nan, nan, nan],
              'yaxis': 'y5'},
             {'name': "1950's",
              'type': 'bar',
              'uid': '56c1cdf2-5b1d-40f1-8740-28a283acdc32',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x11',
              'y': [4.257654580036067, 4.611394011340291, 3.9599472157084987,
                    2.3560258571931225, 2.0253058652647704, 3.0322157032979815,
                    2.2227164711475833, 2.1903316981702914, 1.6127838567197355,
                    1.1760912590556813, 0.9542425094393249, 0.6020599913279624,
                    0.47712125471966244, 0.3010299956639812, nan, nan, nan],
              'yaxis': 'y6'},
             {'name': "1960's",
              'type': 'bar',
              'uid': 'eff4034e-c675-40fa-bd41-121af8678279',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x12',
              'y': [3.941312625360662, 4.36287813578893, 3.957798774929998,
                    2.287801729930226, 1.9242792860618816, 2.967079734144497,
                    2.3729120029701067, 2.4653828514484184, 2.113943352306837,
                    1.6232492903979006, 1.591064607026499, 1.3010299956639813,
                    1.2787536009528289, 1.1139433523068367, nan, nan, nan],
              'yaxis': 'y6'},
             {'name': "1970's",
              'type': 'bar',
              'uid': '66855430-9207-45c7-9024-4a1c727b95fe',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x13',
              'y': [3.490379920003179, 4.116175891100349, 3.593728998707911,
                    2.27415784926368, 1.9395192526186185, 2.416640507338281,
                    1.9294189257142926, 1.8260748027008264, 1.5314789170422551,
                    1.568201724066995, 1.7160033436347992, 1.3424226808222062,
                    1.2041199826559248, 1.146128035678238, nan, nan, nan],
              'yaxis': 'y7'},
             {'name': "1980's",
              'type': 'bar',
              'uid': '92d629d0-146f-4e9f-a0c0-c2aa456bbd84',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x14',
              'y': [3.2676409823459154, 4.152257759187656, 3.9196532823103643,
                    2.482873583608754, 1.8512583487190752, 2.404833716619938,
                    2.0492180226701815, 1.7481880270062005, 1.6812412373755872,
                    1.6334684555795864, 1.6232492903979006, 1.5563025007672873,
                    1.4471580313422192, 1.414973347970818, 0.3010299956639812, 0.0,
                    nan],
              'yaxis': 'y7'},
             {'name': "1990's",
              'type': 'bar',
              'uid': 'de87d5fe-6ecd-4a75-b494-c1e8c9bd8f02',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x15',
              'y': [3.1386184338994925, 4.265195306285716, 3.942057683841395,
                    2.7363965022766426, 2.110589710299249, 2.510545010206612,
                    1.7160033436347992, 1.255272505103306, 1.255272505103306,
                    1.146128035678238, 1.3979400086720377, 1.0413926851582251,
                    0.6989700043360189, 1.0791812460476249, 0.0, nan, nan],
              'yaxis': 'y8'},
             {'name': "2000's",
              'type': 'bar',
              'uid': '049c3ce6-fe68-4dc1-8dd7-639acb87d78c',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x16',
              'y': [3.0629578340845103, 4.310289623796099, 4.1157768761589635,
                    3.390758528738717, 2.884795363948981, 3.2111205412580492,
                    2.4149733479708178, 2.0043213737826426, 1.7481880270062005,
                    1.5563025007672873, 1.5314789170422551, 1.3424226808222062,
                    1.3617278360175928, 1.4471580313422192, 0.3010299956639812,
                    0.6020599913279624, nan],
              'yaxis': 'y8'},
             {'name': "2010's",
              'type': 'bar',
              'uid': '69b7b7c6-3e7f-446f-b783-043943d01b45',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x17',
              'y': [2.7015679850559273, 3.6339731557896737, 3.3170181010481117,
                    3.1319392952104246, 2.606381365110605, 3.0499928569201424,
                    2.2810333672477277, 1.7853298350107671, 1.7403626894942439,
                    1.4913616938342726, 1.462397997898956, 1.1139433523068367,
                    1.255272505103306, 1.255272505103306, 1.2041199826559248,
                    0.6989700043360189, nan],
              'yaxis': 'y9'},
             {'name': "1850's",
              'type': 'bar',
              'uid': '564422b9-a038-4fbc-a8c9-2c7d48675e57',
              'x': [1, 2, 3, 4, 5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40,
                    41-45, 46-60, 61-75, 76-150, 151-300],
              'xaxis': 'x18',
              'y': [0.47712125471966244, 1.0413926851582251, 1.6232492903979006,
                    1.6812412373755872, 1.146128035678238, 0.9030899869919435, 0.0,
                    nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
              'yaxis': 'y9'}],
    'layout': {'annotations': [{'font': {'size': 16},
                                'showarrow': False,
                                'text': "1850's",
                                'x': 0.225,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 1.0,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1860's",
                                'x': 0.775,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 1.0,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1870's",
                                'x': 0.225,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.8827160493827161,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1880's",
                                'x': 0.775,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.8827160493827161,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1890's",
                                'x': 0.225,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.7654320987654322,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1900's",
                                'x': 0.775,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.7654320987654322,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1910's",
                                'x': 0.225,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.6481481481481483,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1920's",
                                'x': 0.775,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.6481481481481483,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1930's",
                                'x': 0.225,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.5308641975308642,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1940's",
                                'x': 0.775,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.5308641975308642,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1950's",
                                'x': 0.225,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.41358024691358025,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1960's",
                                'x': 0.775,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.41358024691358025,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1970's",
                                'x': 0.225,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.2962962962962963,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1980's",
                                'x': 0.775,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.2962962962962963,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "1990's",
                                'x': 0.225,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.17901234567901234,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "2000's",
                                'x': 0.775,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.17901234567901234,
                                'yanchor': 'bottom',
                                'yref': 'paper'},
                               {'font': {'size': 16},
                                'showarrow': False,
                                'text': "2010's",
                                'x': 0.225,
                                'xanchor': 'center',
                                'xref': 'paper',
                                'y': 0.0617283950617284,
                                'yanchor': 'bottom',
                                'yref': 'paper'}],
               'height': 1500,
               'showlegend': False,
               'title': {'text': 'Number of Floors - Log10 Transformed Building Counts'},
               'width': 800,
               'xaxis': {'anchor': 'y', 'domain': [0.0, 0.45]},
               'xaxis10': {'anchor': 'free', 'domain': [0.55, 1.0], 'position': 0.4691358024691358},
               'xaxis11': {'anchor': 'y6', 'domain': [0.0, 0.45]},
               'xaxis12': {'anchor': 'free', 'domain': [0.55, 1.0], 'position': 0.35185185185185186},
               'xaxis13': {'anchor': 'y7', 'domain': [0.0, 0.45]},
               'xaxis14': {'anchor': 'free', 'domain': [0.55, 1.0], 'position': 0.2345679012345679},
               'xaxis15': {'anchor': 'y8', 'domain': [0.0, 0.45]},
               'xaxis16': {'anchor': 'free', 'domain': [0.55, 1.0], 'position': 0.11728395061728394},
               'xaxis17': {'anchor': 'y9', 'domain': [0.0, 0.45]},
               'xaxis18': {'anchor': 'free', 'domain': [0.55, 1.0], 'position': 0.0},
               'xaxis2': {'anchor': 'free', 'domain': [0.55, 1.0], 'position': 0.9382716049382716},
               'xaxis3': {'anchor': 'y2', 'domain': [0.0, 0.45]},
               'xaxis4': {'anchor': 'free', 'domain': [0.55, 1.0], 'position': 0.8209876543209876},
               'xaxis5': {'anchor': 'y3', 'domain': [0.0, 0.45]},
               'xaxis6': {'anchor': 'free', 'domain': [0.55, 1.0], 'position': 0.7037037037037037},
               'xaxis7': {'anchor': 'y4', 'domain': [0.0, 0.45]},
               'xaxis8': {'anchor': 'free', 'domain': [0.55, 1.0], 'position': 0.5864197530864198},
               'xaxis9': {'anchor': 'y5', 'domain': [0.0, 0.45]},
               'yaxis': {'anchor': 'x', 'domain': [0.9382716049382716, 1.0]},
               'yaxis2': {'anchor': 'x3', 'domain': [0.8209876543209876, 0.8827160493827161]},
               'yaxis3': {'anchor': 'x5', 'domain': [0.7037037037037037, 0.7654320987654322]},
               'yaxis4': {'anchor': 'x7', 'domain': [0.5864197530864198, 0.6481481481481483]},
               'yaxis5': {'anchor': 'x9', 'domain': [0.4691358024691358, 0.5308641975308642]},
               'yaxis6': {'anchor': 'x11', 'domain': [0.35185185185185186, 0.41358024691358025]},
               'yaxis7': {'anchor': 'x13', 'domain': [0.2345679012345679, 0.2962962962962963]},
               'yaxis8': {'anchor': 'x15', 'domain': [0.11728395061728394, 0.17901234567901234]},
               'yaxis9': {'anchor': 'x17', 'domain': [0.0, 0.0617283950617284]}}
})

Part 2: Datashader

Datashader is a library from Anaconda that does away with the need for binning data. It takes in all of your datapoints, and based on the canvas and range returns a pixel-by-pixel calculations to come up with the best representation of the data. In short, this completely eliminates the need for binning your data.

As an example, lets continue with our question above and look at a 2D histogram of YearBuilt vs NumFloors:

In [15]:
yearbins = 200
floorbins = 200

yearBuiltCut = pd.cut(ny['yearbuilt'], np.linspace(ny['yearbuilt'].min(), ny['yearbuilt'].max(), yearbins))
numFloorsCut = pd.cut(ny['numfloors'], np.logspace(1, np.log(ny['numfloors'].max()), floorbins))

xlabels = np.floor(np.linspace(ny['yearbuilt'].min(), ny['yearbuilt'].max(), yearbins))
ylabels = np.floor(np.logspace(1, np.log(ny['numfloors'].max()), floorbins))

data = [
    go.Heatmap(z = ny.groupby([numFloorsCut, yearBuiltCut])['bbl'].count().unstack().fillna(0).values,
              colorscale = 'Greens', x = xlabels, y = ylabels)
]

py.iplot(data)

This shows us the distribution, but it's subject to some biases discussed in the Anaconda notebook Plotting Perils.

Here is what the same plot would look like in datashader:

In [16]:
cvs = ds.Canvas(800, 500, x_range = (ny['yearbuilt'].min(), ny['yearbuilt'].max()), 
                                y_range = (ny['numfloors'].min(), ny['numfloors'].max()))
agg = cvs.points(ny, 'yearbuilt', 'numfloors')
view = tf.shade(agg, cmap = cm(Greys9), how='log')
export(tf.spread(view, px=2), 'yearvsnumfloors')
Out[16]:

That's technically just a scatterplot, but the points are smartly placed and colored to mimic what one gets in a heatmap. Based on the pixel size, it will either display individual points, or will color the points of denser regions.

Datashader really shines when looking at geographic information. Here are the latitudes and longitudes of our dataset plotted out, giving us a map of the city colored by density of structures:

In [17]:
NewYorkCity   = (( -74.29,  -73.69), (40.49, 40.92))
cvs = ds.Canvas(700, 700, *NewYorkCity)
agg = cvs.points(ny, 'lon', 'lat')
view = tf.shade(agg, cmap = cm(inferno), how='log')
export(tf.spread(view, px=2), 'firery')
Out[17]:

Interestingly, since we're looking at structures, the large buildings of Manhattan show up as less dense on the map. The densest areas measured by number of lots would be single or multi family townhomes.

Unfortunately, Datashader doesn't have the best documentation. Browse through the examples from their github repo. I would focus on the visualization pipeline and the US Census Example for the question below. Feel free to use my samples as templates as well when you work on this problem.

Question

You work for a real estate developer and are researching underbuilt areas of the city. After looking in the Pluto data dictionary, you've discovered that all tax assessments consist of two parts: The assessment of the land and assessment of the structure. You reason that there should be a correlation between these two values: more valuable land will have more valuable structures on them (more valuable in this case refers not just to a mansion vs a bungalow, but an apartment tower vs a single family home). Deviations from the norm could represent underbuilt or overbuilt areas of the city. You also recently read a really cool blog post about bivariate choropleth maps, and think the technique could be used for this problem.

Datashader is really cool, but it's not that great at labeling your visualization. Don't worry about providing a legend, but provide a quick explanation as to which areas of the city are overbuilt, which areas are underbuilt, and which areas are built in a way that's properly correlated with their land value.

In [18]:
# assuming that this is a valid calculation?  
# Makes sense that the total value minus the land value should equal the structure value, right?
ny['assessstructure'] = ny['assesstot'] - ny['assessland']
ny[['assessland', 'assessstructure', 'assesstot']].head(12)
Out[18]:
assessland assessstructure assesstot
0 6787.0 23901.0 30688.0
1 9758.0 24996.0 34754.0
2 67764.0 1289409.0 1357173.0
3 12191.0 26110.0 38301.0
4 13079.0 17377.0 30456.0
5 8341.0 25167.0 33508.0
7 56258.0 1270348.0 1326606.0
8 3761.0 79308.0 83069.0
9 43105.0 189069.0 232174.0
11 68492.0 609174.0 677666.0
12 1680.0 201776.0 203456.0
13 11083.0 82753.0 93836.0
In [19]:
# Bin the data in three equal sized categories for both land value and structure value
val_labels = ['low', 'med', 'high']
ny['land_group'] = pd.qcut(ny['assessland'], 3, labels=val_labels)
ny['structure_group'] = pd.qcut(ny['assessstructure'], 3, labels=val_labels)
ny[['land_group', 'structure_group']].head(12)
Out[19]:
land_group structure_group
0 low med
1 med med
2 high high
3 high med
4 high low
5 med med
7 high high
8 low high
9 high high
11 high high
12 low high
13 med high
In [20]:
ny['land_group'].value_counts()
Out[20]:
med     270827
low     270819
high    270785
Name: land_group, dtype: int64
In [21]:
ny['structure_group'].value_counts()
Out[21]:
low     270828
high    270804
med     270799
Name: structure_group, dtype: int64
In [22]:
# Combine the two binned categories into one 9 category variable
ny['combined_value'] = ny['land_group'].str.cat(ny['structure_group'].values.astype(str), sep='-')
ny['combined_value'] = pd.Categorical(ny['combined_value'])
ny['combined_value'].head(12)
Out[22]:
0       low-med
1       med-med
2     high-high
3      high-med
4      high-low
5       med-med
7     high-high
8      low-high
9     high-high
11    high-high
12     low-high
13     med-high
Name: combined_value, dtype: category
Categories (9, object): [high-high, high-low, high-med, low-high, ..., low-med, med-high, med-low, med-med]

In [23]:
# assign colors to the 9 categories
# land values range from gray to bright pink and structure values range from gray to bright blue
colors = {'low-low': '#e8e8e8',
          'low-med': '#ace4e4',
          'low-high': '#5ac8c8',
          'med-low': '#dfb0d6',
          'med-med': '#a5add3',
          'med-high': '#5698b9',
          'high-low': '#be64ac',
          'high-med': '#8c62aa',
          'high-high': '#3b4994'
}

# Plot the data
NewYorkCity = (( -74.29,  -73.69), (40.49, 40.92))
cvs = ds.Canvas(900, 900, *NewYorkCity)
agg = cvs.points(ny, 'lon', 'lat', ds.count_cat('combined_value'))
view = tf.shade(agg, color_key = colors)
export(tf.spread(view, px=2), 'pretty_colors')
/Users/betsyrosalen/anaconda3/lib/python3.6/site-packages/datashader/utils.py:348: FutureWarning:

`Series.cat.categorical` has been deprecated. Use the attributes on 'Series.cat' directly instead.

Out[23]:

Conclusion

Almost everything in Manhattan seems to fall within the high-high (dark-blue) category, while the outer boroughs show more variance. There are quite a few areas in the low-med, med-high, and low-high (teal-blue) categories indicating some overbuilt areas in Brookyln and Queens, with far less of them scattered through the Bronx and Staten Island. The North Shore of Staten Island has the largest area in the brightest pink low-high category indicating underbuilt land, but there are also very large areas in the other pink/purple (med-low and high-med) colors that would indicate less extremely underbuilt areas in parts of Brooklyn and Queens that are furthest from Manhattan.

Trying a different approach...

Not sure about my method for cutting the data above. If there are extreme outlier values that would actually get hidden by this approach, because I cut the data into basically 3 equal parts for low, med, and high. So anything that is far above or far below normal would just get labeled as high or low. So I thought maybe I should try cutting the data into thirdds based on the max and min values instead...

In [24]:
# Get max values
land_max = ny['assessland'].max()
struc_max = ny['assessstructure'].max()
In [25]:
# Create bins based on max values
land_bins = [0, land_max/3, (land_max/3)*2, land_max]
land_bins
Out[25]:
[0, 1070425350.0, 2140850700.0, 3211276050.0]
In [26]:
struc_bins = [0, struc_max/3, (struc_max/3)*2, struc_max]
struc_bins
Out[26]:
[0, 1248879517.6666667, 2497759035.3333335, 3746638553.0]

I also checked the minimum and there were a significant number of zeros as well as very low values, so I thought to leave zero as a minimum. The provblem is that the maximum value is so far above the rest of the values that by binning this way almost everything falls in the low category for both land assessment and structure value. You can see the few outliers in the tables below.

In [27]:
ny['land_group2'] = pd.cut(ny['assessland'], bins=land_bins, labels=val_labels)
ny['structure_group2'] = pd.cut(ny['assessstructure'], bins=struc_bins, labels=val_labels)
ny[['assessland', 'assessstructure', 'structure_group2']][ny['land_group2']=='med']
Out[27]:
assessland assessstructure structure_group2
In [28]:
ny[['assessland', 'assessstructure', 'structure_group2']][ny['land_group2']=='high']
Out[28]:
assessland assessstructure structure_group2
763803 3.211276e+09 3.746639e+09 high
In [29]:
ny[['assessland', 'assessstructure', 'structure_group2']][ny['structure_group2']=='med']
Out[29]:
assessland assessstructure structure_group2
464685 225000000.0 1.349346e+09 med
764302 907958700.0 1.887374e+09 med
In [30]:
ny[['assessland', 'assessstructure', 'structure_group2']][ny['structure_group2']=='high']
Out[30]:
assessland assessstructure structure_group2
763803 3.211276e+09 3.746639e+09 high

Trying a log transformation...

So next I thought maybe a log transformation on the assessment varibles first might solve the problem...

In [31]:
# Get log transfromed data, fill NA's and neg infinity with zeros
log_land = np.log(ny['assessland']).fillna(0)
log_structure = np.log(ny['assessstructure']).fillna(0)
from numpy import isneginf
log_land[isneginf(log_land)]=0
log_structure[isneginf(log_structure)]=0
/Users/betsyrosalen/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning:

divide by zero encountered in log

/Users/betsyrosalen/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: RuntimeWarning:

divide by zero encountered in log

Visualization of the log transformed land values

Even after log transforming the data, there is still a strong skew in the land values.

In [32]:
trace = go.Histogram(
    x = log_land,
    nbinsx = 100
)

layout = go.Layout(
    xaxis = dict(title = 'log assessland')
)

fig = go.Figure(data = [trace], layout = layout)

py.iplot(fig)

Visualization of the log transformed structure values

Similarly, there is still a strong skew in the structure values even after log transforming the data as well.

In [34]:
trace = go.Histogram(
    x = log_structure,
    nbinsx = 100
)

layout = go.Layout(
    xaxis = dict(title = 'log assessstructure')
)

fig = go.Figure(data = [trace], layout = layout)

py.iplot(fig)

That's a little better, so let's try binning that...

In [35]:
# Get max log transformed values
log_land_max = log_land.max()
log_struc_max = log_structure.max()
In [36]:
# Create bins based on max log transformed values
log_land_bins = [-10, log_land_max/3, (log_land_max/3)*2, log_land_max+1]
log_land_bins
Out[36]:
[-10, 7.296644739487114, 14.593289478974228, 22.88993421846134]
In [37]:
log_struc_bins = [-10, log_struc_max/3, (log_struc_max/3)*2, log_struc_max+1]
log_struc_bins
Out[37]:
[-10, 7.3480416296893365, 14.696083259378673, 23.04412488906801]
In [38]:
ny['land_group3'] = pd.cut(log_land, bins=log_land_bins, labels=val_labels)
ny['structure_group3'] = pd.cut(log_structure, bins=log_struc_bins, labels=val_labels)
ny[['assessland', 'assessstructure', 'land_group3', 'structure_group3']].head(12)
Out[38]:
assessland assessstructure land_group3 structure_group3
0 6787.0 23901.0 med med
1 9758.0 24996.0 med med
2 67764.0 1289409.0 med med
3 12191.0 26110.0 med med
4 13079.0 17377.0 med med
5 8341.0 25167.0 med med
7 56258.0 1270348.0 med med
8 3761.0 79308.0 med med
9 43105.0 189069.0 med med
11 68492.0 609174.0 med med
12 1680.0 201776.0 med med
13 11083.0 82753.0 med med
In [39]:
ny['combined_value3'] = ny['land_group3'].str.cat(ny['structure_group3'].values.astype(str), sep='-')
ny['combined_value3'] = pd.Categorical(ny['combined_value3'])
ny['combined_value3'].head()
Out[39]:
0    med-med
1    med-med
2    med-med
3    med-med
4    med-med
Name: combined_value3, dtype: category
Categories (8, object): [high-high, high-low, high-med, low-low, low-med, med-high, med-low, med-med]
In [40]:
ny['combined_value3'].value_counts()
Out[40]:
med-med      785041
med-high      11044
low-med        8677
high-high      4440
med-low        2407
low-low         441
high-med        356
high-low         25
Name: combined_value3, dtype: int64

Well, that is not very encouraging. Most of the values seem to be falling in the med-med category now, but let's keep going, plot it anyway, and see what we get.

In [41]:
# Plot the data
NewYorkCity = (( -74.29,  -73.69), (40.49, 40.92))
cvs = ds.Canvas(900, 900, *NewYorkCity)
agg = cvs.points(ny, 'lon', 'lat', ds.count_cat('combined_value3'))
view = tf.shade(agg, color_key = colors)
export(tf.spread(view, px=2), 'pretty_colors')
/Users/betsyrosalen/anaconda3/lib/python3.6/site-packages/datashader/utils.py:348: FutureWarning:

`Series.cat.categorical` has been deprecated. Use the attributes on 'Series.cat' directly instead.

Out[41]:

Conclusion

So this was worth trying I guess, but not very useful or informative. Almost everything seems to fall within the med-med category, and nothing in the bright pink low-high (land-structure) category or even any pink/purple colors that would indicate underbuilt areas. There are, however, some in the low-med and med-high blue teal colors indicating some possibly overbuilt areas. Possibly tweaking the bins so that they aren't exact thirds of the maximum value might make this plot more useful.

One more time...

Let's reduce the medium range by just adding 2 to the lower end and subtracting 2 from the higher end.

In [42]:
log_land_bins = [-10, (log_land_max/3)+2, ((log_land_max/3)*2)-2, log_land_max+1]
log_struc_bins = [-10, (log_struc_max/3)+2, ((log_struc_max/3)*2)-2, log_struc_max+1]
ny['land_group4'] = pd.cut(log_land, bins=log_land_bins, labels=val_labels)
ny['structure_group4'] = pd.cut(log_structure, bins=log_struc_bins, labels=val_labels)
ny['combined_value4'] = ny['land_group4'].str.cat(ny['structure_group4'].values.astype(str), sep='-')
ny['combined_value4'] = pd.Categorical(ny['combined_value4'])
ny['combined_value4'].value_counts()
Out[42]:
low-med      421142
med-med      249386
low-low       61958
med-high      39001
high-high     24880
med-low       13669
high-med       1392
low-high        794
high-low        209
Name: combined_value4, dtype: int64
In [43]:
# Plot the data
NewYorkCity = (( -74.29,  -73.69), (40.49, 40.92))
cvs = ds.Canvas(900, 900, *NewYorkCity)
agg = cvs.points(ny, 'lon', 'lat', ds.count_cat('combined_value4'))
view = tf.shade(agg, color_key = colors)
export(tf.spread(view, px=2), 'pretty_colors')
/Users/betsyrosalen/anaconda3/lib/python3.6/site-packages/datashader/utils.py:348: FutureWarning:

`Series.cat.categorical` has been deprecated. Use the attributes on 'Series.cat' directly instead.

Out[43]:

Conclusion

Adjusting the bins give us a much better picture. Quite a lot more slightly overbuilt teal blue low-med areas show up all over the outer boroughs, and some med-high areas in the Bronx. Still almost no underbuilt areas in pink though, except for a few tiny spots scattered around the areas furthest from the city, and especially at the Brooklyn-Queens border and on the North Shore of Staten Island. Since the assessment data was so highly skewed, this may be a more accurate representation of the data than our original plot. An image that shows the city as much more overbuilt than underbuilt also seems more likely based on what I know about the city. It also shows some more variation in Manhattan that was completely lost in the original plot.